^^^mmmjimm^m^-. 


TECHNICAL  REPORT 


FAST  PROBABILISTIC  TECHNIQUES  FOR  DYNAMIC 
PARALLEL  ADDITION,  PARALLEL  COUNTING  AND 
THE  PROCESSOR  IDENTIFICATION  PROBLEM 

By 

Paul  G.  Spirakis 
November  1984 

Technical  Report  #144 
Ultracomputer  Note  #80 


NEW  YORK  UNIVERSITY 


Department  of  Computer  Science 
Courant  Institute  of  Mathematical  Sciences 

251  MERCER  STREET,  NEW  YORK,  N.Y.  10012 


^ 


FAST  PROBABILISTIC  TECHNIQUES  FOR  DYNAMIC 
PARALLEL  ADDITION,  PARALLEL  COUNTING  AND 
THE  PROCESSOR  IDENTIFICATION  PROBLEM 

By 

Paul  G.  Spirakis 
November  198  4 

Technical  Report  #144 
Ultracomputer  Note  #80 


/I  i 


Fast  Probabilistic  Techniques  for  Dynamic  Parallel  Addition,  Parallel 
Counting,  and  the  Processor  Identification  Problem 

PaulG.  Sp Irakis^ 

Courant  Institute,  N.Y.U. 

251  Mercer  Street 

New  York,  N.Y.  10012 

Ultracomputer  Note  #80 

November,  1984 

ABSTRACT 

We  consider  here  three  problems  in  synchronous  parallel  computation, 
with  the  common  characteristic  that  the  shared  memory  available  to 
the  processors  is  restricted.  The  first  is  the  problem  of  dynamic 
parallel  addition:  n  processors  are  given  each  keeping  a  number 
Xj  i=  l,...,n.  We  know  that  only  m  <  n  of  the  X;  's  are  nonzero  but 
we  don't  know  in  advance  which  processors  have  the  nonzero 
numbers.  We  show  how  to  compute  the  sum  of  the  Xj's  in  0(log  m) 
expected  parallel  time  and  0(m)  shared  memory  in  the  concurrent 
read  -  concurrent  write  model  of  parallel  computation,  through  a 
probabilistic  algorithm.  We  then  consider  the  related  problem  of 
processor  identification:  n  processors  are  given,  each  keeping  either  a  0 
or  a  1.  We  want  each  processor  at  the  end  to  know  which  are  the 
processors  with  the  I's.  The  shared  memory  available  is  very  small 
(say  1  location),  and  can  store  only  0(n)  size  numbers.  To  solve  this 
problem  0(n)  time  is  required  in  a  paraillel  read-exclusive  write  (or 
even  concurrent  write)  PRAM,  even  if  0(n)  shared  memory  is 
av£iilable. 

We  show  how  to  do  it  in  0(n/logn)  parallel  time,  if  we  use  a  weak 
version  of  a  (more  powerful)  model  of  parallel  computation,  the  so-called 
paracomputer.  Our  algorithm  is  again  probabilistic.  For  this  algorithm, 
we  use  combinatorial  ideas  developed  by  Erdos  and  Renyi  [ER,63]  . 
Finally,  we  show  how  to  do  approximate  parallel  counting  of  0(n)  units 
in  0(1)  parallel  time,  by  using  0(n/logn)  shared  memory,  and  an  n- 
processor  concurrent  read  -  concurrent  write  machine. 


^This  work  was  supported  in  part  by  the  NSF  Grant  MCS  83-00630. 


1.   Introduction 

We  consider  here  three  related  problems  in  synchronous  parallel 
computation.  The  models  of  parallel  machines  we  use  are  (a)  the  W-RAM 
(see  e.g.  [SV,80]  ,  [G,78]  )  and  (b)  the  paracomputer  (see  [S,80]  ).  Both 
models  assume  the  presence  of  a  (potentially)  unlimited  number  of 
processors  with  (potentially  unlimited)  local  memory  in  each  processor.  In 
both  models,  processors  are  capable  of  doing  independent  probabilistic 
choices  on  a  fixed  input.  (This  was  first  used  by  [R,82a,b]  and  [V,83]  ).  W- 
RAM  is  like  the  P-RAM  of  [W,79]  and  [FW,78]  in  the  sense  that  different 
processors  can  read  the  same  memory  location  at  the  same  time.  However, 
W-RAM  is  stronger  than  P-RAM,  because  it  allows  simultaneous  access  to 
the  same  memory  location  for  write  operations  (concurrent  read  -  concurrent 
write,  CRCW).  In  the  case  of  a  simultaneous  write  attempt,  exactly  one 
processor  succeeds.  We  make  no  assumption  of  which  one  succeeds,  but  we 
assume  that  the  failed  ones  are  notified.  This  can  be  easily  implemented 
having  processors  read  the  result  of  the  "write".  The  paracomputer  model 
also  allows  simultaneous  writes  to  the  same  memory  location,  by  permitting 
the  simultaneous  execution  of  many  fetch-and-add  (FA)  instructions  on  the 
same  memory  location.  (See,  e.g.  [GLR,80]  ).  The  effect  of  simultaneous 
actions  by  the  processors  is  as  if  the  actions  occurred  in  some  (unspecified) 
serial  order.  The  semantics  of  an  individual  FA  are  as  follows:  Let  v  be  a 
shared  memory  address  and  e  a  value.  Then,  if  processor  P  executes  x  - 
FA(v,e)  alone  (where  x  is  a  local  variable  of  P),  what  happens  is  that  (a)  x  is 
assigned  the  contents  [v]  of  v,  and  (b)  [v]  -  [v]  -l-e  is  executed.  The  logical 
order  is  first  (a),  then  (b),  but  each  FA  takes  constant  time.  The  logical 
effect  of  many  FA  operations  on  the  same  shared  variable  is  as  if  these 
operations  occurred  in  some  (unspecified)  serial  order.  In  the  paracomputer, 
the  simultaneous  execution  of  many  FA  takes  1  step. 

In  this  paper  we  use  a  weak  version  of  a  paracomputer:  simultaneous  FA 
are  allowed  only  if  all  FA's  try  to  update  v  by  using  the  same  value  e.  We  call 
this  a  weak-paracomputer. 

A  realization  of  the  paracomputer  (called  the  ultracomputcr,  see  [S,80]) 
is  one  of  the  few  currently  feasible  general  purpose  parallel  machines.  In 
foreseeable  technologies,  there  is  no  shared  memory,  data  items  ar  stored  in 
local  memories  and  a  processor  can  receive  or  send  only  one  data  item  per 
unit  time.  With  this  in  mind,  a  simulation  of  an  exclusive  read  -  exclusive 
write  machine  (EREW)  and  that  of  a  WRAM  or  a  paracomputer  are  almost 
of  the  same  cost  (see  [UW,84],  [MV,83],  [V,83b],  [V,83a]).  This  justifies  in 
part  the  use  of  a  WRAM  or  stronger  models  for  design  of  parallel 
algorithms. 

The  first  of  the  problems  we  consider  is  that  of  dynamic  parallel 
addition:  n  processors  of  a  W-RAM  are  given  a  number  Xj  each,  i  =  l,...,n.  It 
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is  known  that  only  m  ^  n  of  the  Xj's  are  nonzero  but  processors  do  not  know 
in  advance  which  processors  have  the  nonzero  numbers.  We  show  how  to 
compute  the  sum  of  the  x/s  in  0(Iog  m)  expected  parallel  time  and  0(m) 
shared  memory.  The  problem  has  applications  in  the  construction  of  reliable 
arrays  of  sensors  in  robotics.  To  our  knowledge,  deterministic  W-RAM 
algorithms  for  addition  have  to  take  0(log  n)  parallel  time  when  n  processors 
are  used,  and  n  numbers  are  to  be  added. 

The  second  problem  is  that  of  processor  identification:  n  processors  of  a 
weak-paracomputer  are  given,  each  keeping  either  a  0  or  a  1.  Each  of  the 
processors  must  find  out  which  are  the  processors  with  the  I's.  We  assume 
that  each  shared  memory  location  can  fit  a  number  of  at  most  0(n)  size.  We 
show  that  there  is  an  algorithm  to  solve  the  problem  in  0(n/logn)  parallel 
time  and  just  one  shared  memory  location!  This  parallel  time  is  optimal  for 
the  paracomputer.  To  show  this,  we  use  a  nice  technique  due  to  Erdos  and 
Renyi  [ER,63]  .  The  solution  of  the  same  problem  requires  ft(n)  time  in  a 
W-RAM,  even  if  0(n)  shared  memory  locations  arc  available.  We  show  how 
to  make  the  algorithm  constructive,  by  a  probabilistic  technique. 

The  third  problem  we  consider  is  the  problem  of  approximate  parallel 
counting:  n  processors  of  a  W-RAM  arc  given  a  number  Xj  each,  i  -  l,...,n 
and  Xj  is  either  0  or  1.  Each  processor  wants  to  find  whether  the  I's  are  a 
majority.  (Processors  do  not  know  which  processors  contain  Ts.)  We  provide 
a  probabilistic  algorithm  for  answering  a  version  of  this  problem,  which  uses 

0(1)  expected  parallel  time  and  O 
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shared  memory. 


The  version  of  the  problem  we  solve  assumes  that  the  number  of  Xj's 

n  " 

equal  to  1  is  either  >  —  or  < 
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All  our  probabilistic  techniques  have  the  following  strong  property:  If  T 
is  the  actual  parallel  time  of  our  algorithm  and  E(T)  is  its  expected  value, 
then  Prob[T  >  kE(T)]  ^  n"'^,  where  k  is  any  constant  value,  and  c  >  1 
depends  linearly  on  k  and  can  be  controlled  by  the  algorithm  implementer  at 
the  expense  of  an  additional  constant  fraction  of  the  shared  memory  used. 

2.    Dynamic  Parallel  Addition 

2.1.    The  Algorithm 

Let  the  array  M  represent  the  shared  memory.  Let  a  >  =  2  be  a  positive 
integer  constant.  Let  each  processor  be  equipped  with  a  local  variable  TIME, 
intended  to  keep  the  current  parallel  step.  The  algorithm  has  3  stages. 
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Stage  J  (Initialization) 

In  one  parallel  step,  processors  initialize  a-m  +  2  shared  memory 
locations  to  zero,  by  executing:  "Processor  Pj  writes  a  zero  to  M[j]  ,  if  j  ^ 
am  +  2."  Then,  they  all  execute  TIMEj  -  0; 

Stage  2  (Memory  Mairking) 

Processor  P. 

IF  Xj   ^  0  then 
BEGIN 

(1)  Select  y  equiprobably  at  random  from  {1,2,..., am} 

(2)  TIMEj  ^  TIMEj  +  1 

(3)  Read  M[y];  TIMEj  -  TIMEj  +  1 

(4)  If  M[y]  =  0  then  write  Xj   into  M[y]  . 
Also,  TIMEj  -  TIMEj  +  1 

(5)  If  the  "write"  failed  then 

BEGIN 
(5a)   write  TIMEj  intoM[am+l] 
(5b)   goto[l] 
END 
END 
Comment:  This  part  is  executed  by  Pj   with  Xj=  0  and  by  "successful"  Pj 
with  Xj?t  0. 

(6)  Read  M[am+  1]  into  a  local  variable  Rl 

(7)  Wait  for  8  steps 

(8)  Read  M[am+  1]   into  a  local  variable  R2. 

(9)  If  Rl  ^  R2  then  go  to  [6] 

Comment:  Rl  =  R2  means  all  processors  with  Xj  ^^  0  succeeded  into 
writing  Xj  in  a  shared  memory  location,  different  for  each  processor,  among 
M[l]  ,...,M[aM]  .  (If  a  processor  was  failing,  the  value  of  M[am+1]  would 
change). 

Stage  3  (Addition) 

(Processor  Pj  is  assigned  to  location  M[j]  ,  1  :s  j  :s  am) 

>From  this  point  on,  processors  Pj  (where  1  ^  j  ^  am)  perform  a 
standard  parallel  addition  of  the  numbers  M[l]  ,...,M[aM]  .  In  the  ith  parallel 
step  of  the  addition,  processor  Pj  adds  M[j]  and  M[j+2']  into  M[j],  for 
j=  k-2'+  1,  k  =  0,1,2,. ..,am/2'.  (See  e.g.  [K,82]  or  [FW,78]  on  how  to  do 
the  parallel  addition  of  am  numbers  by  am  processors  in  0(m)  space  and 
0(log  m)  parallel  time.) 
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2.2.   Analysis  of  the  Algorithm 

It  is  easy  to  see  that  the  algorithm  presented  uses  0(m)  shared  memory 
and  performs  the  addition  correctly,  because,  at  the  end  of  stage  2,  the  m 
nonzero  Xj's  are  placed  one  in  each  of  m  shared  memory  locations,  and  these 
locations  are  among  M[l]  ,...,M[am]  .  The  rest  of  these  locations  contain 
zeros. 

The  time  complexity  depends  crucially  on  time  of  stage  2  (stage  1  takes 
0(1)  parallel  time  and  stage  3  takes  0(log  m)  parallel  time).  It  is  easy  to  see 
that  every  time  a  processor  Pj  attempts  to  write  its  Xj,  and  if  g  <  =  m  shared 
memory  locations  are  already  "occupied",  the  competitors  of  P;  are  m-g-1. 
Even  if  all  of  them  manage  to  select  different  memory  locations  which  were 
not  occupied  previously,  the  maximum  number  of  locations  that  P:  must 
"avoid"  is  g  +  m-g-1  =  m-1.  So,  Pj  will  succeed  with  probability  at  least 
[am-  (m-  1)  j/am  ^  (a-l)/a  in  each  trial  (and  this  holds  for  every  Pj).  In  the 
full  paper  we  prove  that: 

Theorem  1.  The  average  number  of  attempts  required  for  all  m  processors  to 
succeed  is  0(log  m).  The  probability  that  the  parzillel  time  exceeds  p  log  m  is 
^  nj-P  loga  + 1  (^jjj  (^^jj  jjg  made  arbitrarily  small). 

Proof  sketch:  The  probability  that  there  exists  a  processor  which  continues 
failing  for  2:  p  log  m  rounds  is 

:sm(l/a)P'''8'"  s  m-P'^ea+i. 

Corollary  1.  Our  algorithm  performs  dynamic  parallel  addition  in  0(log  m) 
parallel  expected  time.  It  uses  0(m)  shared  memory.  The  probability  that  the 
parallel  time  of  the  algorithm  exceeds  p  log  m  is  ^m"'^  '"^a  +'. 

3.   The  processor  identification  problem 

3.1.  An  O(n/log  n)  parallel  time,  0(l)-shared  memory,  Weak- 
Paracomputer  Algorithm. 

We  provide  here  an  algorithm  which  needs  only  1  shared  memory 
location,  which  can  hold  only  up  to  0(n)  size-numbers? 

Let  us  remark  there  that  if  a  subset  S=Pij,  .  .  .  ,Pj  of  the  n  processors 
attempt  simultaneously  a  FA(v,e)  on  the  shared  memory  location  v,  and  e  is 
the  value  of  the  processor  Pj^,  then,  after  the  FA  operation,  the  memory 
location  shall  be  augmented  by  the  sum  s-e.  Let  us  imagine  that  all  the 
processors  are  equipped  with  the  same  list  L  =  ^1.^2.  •  •  •  .^m  of  "testing" 
sequences,  where  each  t^  is  an  n-bit  sequence  of  O's  and  I's.  Let  us  assume 

^In  the  usual  paracomputer  model,  with  shared  memory  capable  of  holding  0(2")  size 
numbers,  one  can  solve  the  problem  in  0(1)  parallel  time  by  having  processors  P,  execute 
FA(v  ,2'-ej)  where  ej  is  the  value  of  Pi,  and  have  each  Pi  read  the  result  subsequently. 
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also  that  L  is  independent  of  the  particular  assignment  of  O's  or  I's  to 
processors.  In  the  following,  let  v  be  a  fixed  memory  location,  and  let  Cj  be 
the  value  of  processor  Pi.  Processors  execute  the  following  sequence  of  steps, 
m  times. 

Round  i,  (1:^1^  m) 

(1)  P,    erases  v's  contents  by  reading  v  and  then  doing  FA(v,-value(v)). 

(2)  Each  processor  Pj  (1^  j^n)  looks  in  the  jth  position  of  €;.    If  ^i[j]  =  1 
and  ej=  1,  then  Pj  executes  FA(v,ej). 

(3)  Each  Pj  (1  ^  j^  n)  reads  v  by  executing  FA(v,0). 

At  the  end  of  the  m  rounds,  each  processor  has,  for  each  testing 
sequence,  the  number  of  places  in  which  a  1  stands  both  in  the  testing 
sequence  and  in  the  sequence  eiC2  ■  •  •  e^  to  be  guessed.  If  L  allows  each 
processor  to  find  e,...en  after  the  m  rounds,  we  call  L  an  m-algorithm  for  the 
processor  identification  problem.  (We  allow  unrestricted  local  memory  per 
processor.) 

An  obvious  L  (which  would  take  0(n)  parallel  time)  is  that  consisting  of 
n^i's  with  €i[j]  =  0,  jVi  and  €i(i)=  1  (IrsirSn). 

Erdos  and  Renyi  [ER,63]  considered  a  very  closely  related  problem,  the 
"coin-weight"  problem.  Using  their  techniques,  we  show  that  the  m  needed  is 
9(n/logn)  and  that  L  is  constructive  (in  fact,  there  is  an  easy  probabilistic  way 
to  find  L). 

Let  us  view  L  as  an  mX  n  matrix  of  O's  and  I's. 

Lemma  2  (see  also  [ER,63]  ).  A  matrix  L,  mXn,  of  O's  and  I's  is  an  m- 
algorithm  for  the  processor  identification  problem  iff:  For  each  pair  c,c'  of 
subsets  of  the  set  C  of  columns  of  L,  such  that  c^^c',  if  we  form  the  row- 
sums  of  the  submatrices  L(c)  and  L(c')  (consisting  of  the  selected  columns) 
and  denote  by  V^andV*^.  the  column-vectors,  consisting  of  these  row-sums, 
then  ^c^^v 

Proof  sketch:  After  m-rounds,  each  processor  has  a  row-sum  vector,  V,  of  L. 
This  corresponds  to  just  one  subset  c  of  the  set  of  columns  of  L.  This  subset 
determines  the  processors  which  have  I's.  This  is  so,  because  c  is  exactly  the 
subset  of  processors  with  a  value  equal  to  1. 

Here,  the  techniques  of  [ER,63]   can  be  employed,  to  prove: 

Lemma  3.  A  matrix  L,  mXn,  m  =  an/(log  n),  a^6,  chosen  so  that  the 
mn  entries  are  independent  random  variables  each  taking  on  the  values  0  and 
1  with  probability  1/2,  is  an  m-algorithm,  with  probability  tending  to  1  as 

n-  -f-  00. 

Proof  sketch  (see  full  paper  for  details):  Let  p  =  prob{L  is  an  m-algorithm}. 
Let  q  =  1-p.  Let  E  (cj.Cj),  where  c,,C2  are  subsets  of  the  set  of  columns  C  of 
L,  denote  the  event  that  V^,,  =  V^^  (V^,,  is  the  row-sum  vector  of  L(Ci)).    If  c, 
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and  C2  are  not  disjoint,  then  if  c,  is  C|  minus  the  set  c,  Q  C2  and 
C2  =  C2  -  c,  P)  C2,  then  V^  ™  ^c  •  Hence,  if  L  is  not  an  m-algorithm,  there  exist 
disjoint  subsets  of  the  set  of  columns  such  that  V^,  =  ^cj-  ^o 

q:s2  ProbE(c,,C2) 

where 

C),C2  disjoint  subsets  of  C. 

By  combinatorial  calculations,  one  can  get  then 

q^  2"('"8  3 -a/2)  +o(n) 

where 

an 

m= 

logn. 

If  we  choose  a  >  log2  9+2  then 

qs;  2""  and  limq  =  0 

in 

Lemma  4.   There  exists  an  m-algorithm,  for  m= ,  a^  6. 

logn 

Proof  sketch  (see  full  paper  for  details):  From  Lemma  3,  since  q  <  1  for  such 
m  and  any  n,  then  p  >  0  so,  in  fact,  there  exists  an  m-algorithm. 

3.2.    Lower  bounds  and  Conclusions 

Lemma  5 .    No  m-algorithm  can  have  m<  - — ; -:-. 

^  log(n+l) 

Proof   sketch:    Each    processor    needs    at    least    log(2")    =     n    "pieces    of 

information"  to  distinguish  between  the  2"  possible  assignments  of  O's  and  I's 

to  processors.  On  the  other  hand,  if  k  processors  attempt  a  FA,  then  the 

amount  of  information  obtained  cannot  exceed  log(k+ l)s  log(n-l- 1)  because 

the  number  of  I's  among  them  are  0,1,  or,  ...,  or  k.    So,  m  simultaneous  FAs 

can  give  each  processor  at  most  ralog(n+  1)  pieces  of  information. 

Note:  Lemma  5  gives  us  lower  bounds  in  the  trade-off  between  the  shared 
memory  available  and  the  number  of  parallel  steps  a  weak-paracomputer 
needs  to  solve  the  problem.  Also,  note  that  instead  of  the  weak-paracomputer 
we  could  just  use  a  W-RAM  (we  call  it  the  SW-RAM)  with  the  property  that 
simultaneous  writes  succeed  only  if  they  write  the  same  value  and,  if  that  is 
the  case,  their  sum  is  recorded.   So,  we  get: 

Theorem!.  (A)  The  processor  identification  problem  can  be  solved  by  a 
weak-paracomputer  in  0(n/log  n)  parallel-time,  n  processors  and  0(1)  shared 
memory,  and  this  is  best  possible. 


(B)  The  same  problem  can  be  solved  by  n^  processors  in  O  |  — |  time 
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and  0(s)  shared  memory,  and  this  is  best  possible. 

Theorem  3.  Any  W-RAM  algorithm  requires  at  least  n(n)  parallel  time 
to  solve  the  processor  identification  problem,  by  n  processors  and  0(1) 
shared  memory. 

(Proof  in  full  paper). 

and 

Corollary  3.  The  weak  paracomputer  and  the  SW-RAM  are  more 
powerful  models  than  the  W-RAM. 

Note:  Once  the  processors  have  the  list  L  (and  L  is  an  m-algorithm)  they  can 
construct  a  table  of  the  possible  row-sum  vectors  V^  ^^^  their  corresponding 
subset  c  of  L.  Then,  given  any  instance  of  the  identification  problem,  they 
need  0(m)  parallel  steps  to  find  V^  and  one  (indexed)  table  access  to  find  c 
(and  solve  the  problem).  Another  piece  of  the  preprocessing  work  is  the 
construction  of  L  itself.  With  0(1)  shared  memory  and  a  weak  paracomputer 
model,  the  n-processors  will  need  9(n^/logn)  time  to  agree  to  a  common 
(random)  L.  So,  our  algorithm  becomes  practical  only  in  dynamic 
environments,  where  the  values  of  the  n  processors  change.  Our  algorithm 
can  "identify"  N  sequences  of  O's  and  I's  (where  N  =  Cl(2^))  in  preprocessing 
time  ©(log^N/loglogn),  local  space  0(N),  parallel  time  O  (N  log-N/log  log  N 
and  shared  space  0(1). 

4.   The  problem  of  approximate  counting 

4.1.    The  problem  and  applications 

We  consider  a  CRCW  WRAM  of  n  processors  Pi,...,?^.  Processor  Pi  is 
given  a  number  XjCO,l.  Processors  know  in  advance  that  either  the  I's  are  a 
majority  (>  n/2)  or  that  their  number  is  less  than  n/log  n,  but  they  don't 
know  which  are  the  Pi's  with  Xj  =  1.  We  present  here  a  technique  of  allowing 
the  processors  to  distinguish  one  of  the  two  situations,  in  0(1)  expected 
parallel  time,  and  0(n/log  n)  shared  memory  locations.  We  can  restrict  the 
shared  memory  so  that  each  location  can  keep  only  1  bit  (i.e.  0(n/log  n) 
separately  addressable  bits  of  shared  memory  suffice).  In  the  following,  let  M 
be  the  array  of  shcu-ed  memory  locations.  One  possible  application  of  the 
result  is  parallel  connectivity  algorithms  with  a  random  graph  as  input. 
[ER,60]  proved  that  random  graphs  with  ^  n  edges  and  n  vertices  have  a 
"giant"  connected  component  (of  size  0(n))  with  probability  tending  to  1  as  n 
-  00.  Our  techniques  then  allow  the  connectivity  algorithm  to  identify  the 
giant  component  without  actually  counting  vertices  (and  hence  beat  the 
n(logn)  worst  case  bound  in  parallel  time).  In  the  following,  let  a  2:  4  be  a 
constant  and  let  m  =   |_n/  (a  log  n)_|  -I-  1. 
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4.2.  The  algorithm 

Processor  Pi  initializes  M[i]   to  0,  1  ^  i  ^  m. 

(1)  Each  Pi,  with  Xj  =  1,  chooses  cquiprobably  at  random  an  integer  y  € 
{l,2,...,m-l}. 

(2)  Each  Pi,  with  Xj  =  1,  writes  its  X;  to  M[y]. 

(3)  Each  Pi,  1  ^  i  :s  m-1,  reads  M[i]  .  If  M[i]  =  0  then  Pj  writes  1  to  M[m] 

(4)  All  Pj   read  M[m].  If  it  is  1,  then  they  decide  that  the  number  of  nonzero 
Xj's  is  <  n/log  n  else  they  decide  that  it  is  >  — 

4.3.  Analysis  of  the  algorithm 

The  algorithm  of  Section  4.2  clearly  uses  |_n/  (alogn)_|  +1  separately 
addressable  units  of  shared  memory  (m  bits  would  do)  and  takes  0(1) 
parallel  time.   We  now  prove:  - 

Theorem  4.  Our  algorithm  decides  correctly  with  probability  S:  1  -  n"*^, 
c  >  1  a  constant. 

Proof:  Let  t  =  |  Pj  rXj  =  1| .  The  algorithm  errs  in  two  cases: 
Case  1.  t  >  n/2  and  M[m]  =  1.  This  means  that  at  least  n/2  processors 
selected  an  integer  among  1  and  n/(a  log  n)  at  random  and  one  number  was 
not  selected.  The  probability  that  a  particular  location  was  not  selected  is 
(1-  (alogn)/n)"^~n"*^).  The  probability  that  there  exists  a  location  which 
is  not  selected  is,  then,  upper  bounded  by 

n-n-''^  =  n-«^+'  =  n~'\ 

where 

c,  =  a/2-  12:1 

since  a  ^  4. 

Case  2.  t  <  n/log  n  and  M[m]  =  0.  This  means  that  less  than  n/log  n 
processors  selected  an  integer  among  1  and  n/(a  log  n)  at  random  and  all 
integers  were  selected.  The  situation  is  similar  to  that  throwing  N  =  n/log  n 
balls  at  random  in  N/a  boxes  and  having  at  the  end  at  least  1  ball  per  box. 
From  [F,58]  the  probability  of  the  above  event  is  asymptotically  e~°'* 
(tending  to  0  as  n  ->  (»).  The  theorem  follows. 

4.4.  Discussion 

By  repeating  the  algorithm  of  sec.  4.2  we  can  get  an  algorithm  which 
never  errs  and  whose  expected  parallel  time  is  0(1).  A  special  case  of  our 
algorithm  could  give  a  probabilistic,  fast,  n-input  NAND  implementation. 
We  can  refine  the  algorithm  so  that  the  "interval  of  uncertainty"  becomes 
much  less  than  [n/log  n,  n/2]  ,  with  small  penalty  in  the  expected  parallel  time 
(sec  full  paper). 
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